Dissertation Memo

Discussion of Work

Author

Brian Key

Published

February 25, 2026

Github Repo Link

1 Introduction

1.1 Purpose

Since defending my prospectus, I’ve developed, deployed, and refined a raft of models that carry out my express research agenda. The object of this memo is to describe these models and both illustrate and comment on my central findings.

1.2 Research Questions

Before proceeding, it may be useful to recall the contents of my research agenda. To this end, below are concise formulations of the questions animating each chapter (in order):

  1. Do human rights conditions in preferential trade agreements (PTAs), particularly when codified as more legalized (i.e., “harder”), improve human rights respect among signatories?
  2. Do bilateral investment treaties (BITs) improve human rights respect among signatories?
  3. Do economic sanctions sent by Western states improve human rights respect among non-Western targets?

1.3 Key Findings

As we shall see, my models (as presently construed) suggest the following general answers to my research questions:

  1. “Harder” human rights conditions in PTAs are associated with improved human rights respect in less-democratic states but a worsening of such respect in more-democratic ones.
  2. Likewise, BITs are associated with improved human rights respect in less-democratic states but a worsening of such respect in more-democratic ones.
  3. Conversely, economic sanctions against non-Western states are associated with worsened human rights respect in less-democratic targets but an improvement of such respect in more-democratic ones.

1.4 Overview of Memo

The remainder of my memo is organized as follows:

  1. A description of my general methods for the dissertation writ large (Section 2).
  2. Discussions and illustrations of my work and findings for each chapter (Section 3, Section 4, Section 5).
  3. A conclusion, wherein I contemplate remaining questions and future work (Section 6).
  4. Appendices (Section 8).

2 The General Methodology

2.1 Summary

The following section delineates the overarching methodology common to my work in all three chapters:

  1. Replication (Section 2.2)
  2. Preprocessing (Section 2.3)
    1. Data wrangling (Section 2.3.1)
    2. Multiple imputation (Section 2.3.2)
    3. Spatial lagging (Section 2.3.3)
    4. Start-Year Specification (Section 2.3.4)
    5. Temporal lagging (Section 2.3.5)
  3. Double/debiased machine learning (Section 2.4)
  4. Pooling (Section 2.4)

2.2 Replication

In each chapter, I first run models that approximately replicate those of the literature inspiring my work, the main points of departure (where appropriate) being (1.) the substitution of HR Scores for the authors’ outcomes and my selected treatments for theirs, and (2.) the use of both unimputed (in line with the authors’ methodologies) and multiply-imputed data. I do so to test whether the introduction of new key variables and/or multiple imputation immediately yields divergent results. More on these replications can be found in Section 3.3, Section 4.3, and Section 5.3.

2.3 Preprocessing

2.3.1 Data Wrangling

2.3.1.1 Treatments

Before running the replication and novel models, I assemble a dataset, \(\mathbf{S} = [\mathbf{Y \; D \; X}] \in \mathbb{R}^{N \times (1 + q + k)}\), through data wrangling. The first (and perhaps most involved) step in this process is the construction of the treatment variables, \(\mathbf{D} \in \mathbb{R}^{N \times q}\), a task which generally demands significant transformations to the raw data to ultimately render them in country-year (i.e., panel) format. How these treatments are sourced and generated—and how missingness is handled—is discussed further in Section 3.1, Section 4.1, and Section 5.1.

2.3.1.2 Covariates

Afterwards, I consolidate the outcome, \(\mathbf{Y} \in \mathbb{R}^{N \times 1}\), and covariates, \(\mathbf{X} \in \mathbb{R}^{N \times k}\). I also create additional covariates where necessary. \(\mathbf{X}\) is nearly identical across the three chapters, featuring the following common covariates:

  1. All 26 V-Dem high-level and mid-level indices.
  2. From the Polity Project, Polity Score and Regime Durability.1
  3. World Development Indicators (World Bank) pertaining to:2
    1. Trade.3
    2. Foreign direct investment (FDI).4
    3. Population density.
  4. From Fariss et al. (2022), latent estimates of GDP, GDPpc, and population (logged).
  5. Balance of payments.5
  • 1 The proximate sources of these variables are V-Dem and the Quality of Governance (QoG) dataset, respectively.

  • 2 These variables are sourced from the QoG.

  • 3 Sum of exports and imports of goods and services, % of country-year GDP.

  • 4 Net inflows and outflows, % of country-year GDP.

  • 5 % of country-year GDP.

  • Justifications for the inclusion of these covariates, as well as exceptions to the common elements of \(\mathbf{S}\), are present in Section 3.2, Section 4.2, and Section 5.2.

    2.3.1.3 Case Selection & Coding

    During the assembly of \(\mathbf{S}\), I make several decisions on handling observations featuring missingness in numerous or critical areas and irregular coding attributes. Most of these cases involve (1.) small-nation status, (2.) partial international recognition, and (3.) post-Cold War transitions.

    In the first instance, I omit from \(\mathbf{S}\) all countries failing to appear in V-Dem, coverage in HR Scores notwithstanding. Constituting the lion’s share of these cases are microstates, as classified by the Correlates of War (COW) project. Though this omission comes at the cost of circumscribing my studies’ external validity—namely, to non-microstates—it likely improves the robustness of my final inferences vis-à-vis future replications. Indeed, V-Dem indices constitute the majority of dimensions in each chapter’s \(\mathbf{X}\), meaning conclusions drawn from imputed V-Dem data for microstates would be generally and greatly sensitive to the randomness immanent in the imputation process.6

  • 6 More on my missingness handling method—multiple imputation—may be found in Section 2.3.2.

  • 7 A discussion of my spatial models is to be found in Section 2.3.3.

  • 8 For a discussion of Palestine’s exclusion, see the list’s attendant case description.

  • 9 Only states with accompanying Gleditsch and Ward IDs (i.e., “state numbers”) appear in Fariss et al.’s (2022) data.

  • 10 The computation of these treatments is outlined in Section 3.1 and Section 4.1.

  • Moreover, I omit Palestine in view of problems originating from inconsistent coding patterns. Unlike the foregoing microstates, Palestine is covered in V-Dem. However, its reported scores are disaggregated for much of the post-World War II period, with Gaza and the West Bank treated as mutually-exclusive observations, making it unclear as to how to reconcile these data with the single spatial unit that the source of my spatial data, Natural Earth, supplies for Palestine.7 Equally important, Palestine—absent from Gleditsch & Ward’s List of Independent States (v7) due to its contested international recognition8—lacks the prerequisite for coverage in Fariss et al.’s (2022) GDP, GDPpc, and population variables,9 estimates essential for computing several treatments pre-imputation.10 Owing to these theoretical and logistical roadblocks, I abstain from incorporating Palestine into my models, with the understanding that doing so comes at some inferential cost.

    Included in some of my models, on the other hand, are a few extinct countries, namely East Germany (formally the German Democratic Republic) and South Yemen (formally the People’s Democratic Republic of Yemen). Each was a communist state that unified with its non-communist counterpart (West Germany and North Yemen, respectively) in 1990, amidst the denouement of the Cold War. The two countries enjoy complete data for HR Scores, Fariss et al.’s (2022) estimates, and virtually all V-Dem covariates, inter alia—yet they crucially lack polygons from Natural Earth, which does not offer data on historical geographical boundaries on which to base spatial lags. In virtue of data availability, I opt to integrate East Germany and South Yemen into my non-spatial models, though this inevitably entails a greater number of cases in such models relative to that of my spatial ones. Always marginal (\(\Delta = 2\)), this difference is only present in models with a pre-1990 start year,11 before the two countries naturally “drop out” of the dataset.

  • 11 Start years are discussed in greater detail in Section 2.3.4.

  • A further consideration involving post-Cold War geopolitical reconfigurations is the existence of several discrepancies in case identification—namely, differences across datasets in the coding of continuities between predecessor and successor states. These present challenges for variable collation and subsequent inference; indeed, when joining datasets on country identifiers (e.g., COW codes), the discrepancies may yield unmerited duplicates or missingness, raising doubts over inferential validity in later modelling stages. To harmonize all inputs of \(\mathbf{S}\), I allow decisions adopted by HR Scores and/or V-Dem—the sources of my outcome and the preponderance of covariates—to override alternative coding schemes in the following cases:

    1. Czechia: coded as the successor of Czechoslovakia.
    2. Germany: coded as the successor of West Germany.
    3. Serbia: coded as the successor of Yugoslavia.
    4. Yemen: coded as the successor of North Yemen.

    These decisions being taken, I arrive at my final set of cases. In general, the maximum number of countries featuring in the two-way fixed effects models is 176, whereas the equivalent for the spatial models is 174. For a full breakdown on these figures, as well as complete itemizations of the excluded and manually-coded cases discussed above, see appendices A (Section 8.1), B (Section 8.2), and C (Section 8.3).

    2.3.2 Multiple Imputation

    The aim of my next step is to handle remaining missingness in \(\mathbf{S}\). Of course, simple methods of doing so exist (e.g., listwise deletion, mean imputation, etc.); but these generally rest on the very strong assumption that the data is “missing completely at random” (MCAR), produce unwarrantedly large or small standard errors, or may prove wasteful.12 Endeavoring to avoid these pitfalls, I opt for the more robust method of multiple imputation. Widely regarded as “the best general method to deal with incomplete data” (Van Buuren, 2018), multiple imputation broadly involves the following three steps:

  • 12 Indeed, listwise deletion would result in the excision of about a third of all my observations per model, as suggested by the data presented in Appendix D (Figure 8). For more on “ad-hoc” solutions to handling missing data, see Van Buuren (2018).

    1. From an incomplete dataset, generating \(m > 1\) complete versions thereof.
    2. With each version, running select models to estimate some parameter(s) of interest.
    3. Under “Rubin’s rules,” pooling these estimates to arrive at a single estimate and accompanying variance estimate, thus allowing for causal inference.13
  • 13 For more on the basics of multiple imputation, see Van Buuren (2018).

  • 14 For more on MCAR, MAR, and MNAR, see Van Buuren (2018).

  • Through this process, multiple imputation provides more realistic standard errors and eschews data wasting. It also need not assume MCAR. Specifically, multiple imputation is equipped to handle data that are both “missing at random” (MAR) and “missing not at random” (MNAR) (Van Buuren, 2018). MAR prevails when MCAR may be assumed conditional on knowing what factor(s) explain the missingness at issue, and thus rests on safer footing than MCAR alone. MNAR, meanwhile, exists when such explanatory factors are unknown.14 Carrying out multiple imputation on MNAR data generally requires researchers to assemble more variables explaining the missingness, so as to establish MAR, and/or to conduct an array of sensitivity analyses (Van Buuren, 2018). Lacking obvious evidence of MNAR (particularly after removing the cases described in Section 2.3.1), and guided by the advice that “[t]he MAR assumption is often a suitable starting point” (Van Buuren, 2018), I proceed naively presupposing MAR conditions.

    In all three chapters, I select \(m = 5\). Von Hippel (2009) and White et al. (2011) jointly suggest the following “rule of thumb” when setting \(m\): “the number of imputations should be similar to the percentage of cases that are incomplete” (Van Buuren, 2018). For each chapter and start year, the proportion of observations featuring a missing value is about 30%, suggesting an \(m\) in the vicinity of 30. Undercutting this rule of thumb, however, are a few practical difficulties, as Van Buuren observes. First, the recommended value of \(m\) is likely to be large for datasets with large numbers of variables, and indeed can only increase as dimensions are added, since observations are likelier to feature at least one missing value the more variables they possess. Second—and relatedly—increasing \(m\) correspondingly increases the computational and storage costs of the analysis, as each model must be fit on each \(\mathbf{m}_{i}\). Van Buuren (2018) instead recommends setting \(m\) to the average missing data rate (i.e., the proportion of values missing from the dataset), or to 5 when computational and storage capabilities are more limited. In fact, the average missing data rate for each chapter and start year is approximately 2%—even when excluding variables that are complete by design, namely the outcome, treatments, and fixed effects—meaning \(m = 5\) is more than sufficient under the circumstances. For a complete summary of these missing data rates, see Appendix D (Section 8.4).

    To create the imputed data, I avail myself of mice::futuremice, which permits users to generate each \(\mathbf{m}_{i}\) in parallel across individual CPU cores, significantly quickening computation time. In addition to setting \(m\),15 I further specify the imputation method—random forests, equipped to handle both numeric and categorical data (Van Buuren, 2018)—and the predictor variables.16 Finally, I set a seed to ensure reproducibility of the imputations. Implementing the futuremice with these arguments ultimately yields five completed iterations of \(\mathbf{S}\): \((\mathbf{m}_{i})_{i=1}^5\). It is from these iterations that I conduct all remaining work, pooling the results obtained from each \(\mathbf{m}_{i}\) in the final stage.17

  • 15 Technically, this is done by tolerating the default setting \(m = 5\).

  • 16 For more on the process of and algorithm for tree-based imputation, see (Van Buuren, 2018)

  • 17 For more on pooling, see Section 2.5.

  • 2.3.3 Spatial Lagging

    In each chapter, I run two sets of models: one predicated on traditional two-way fixed effects, the other on spatial econometrics. My preliminary plans for models drawing from this tradition are discussed in my prospectus, specifically as regards chapters 2 and 3; but at this juncture, it’s imperative to elaborate not only on why I’ve instead implemented spatial models for all three chapters, but also on the specific spatial model I’ve chosen.

    The cardinal aim of spatial-econometric models is to enable social scientists to control for spatial dependencies between observations in geographically-structured data, where traits of unit \(x\) may correlate with or affect those of neighboring unit \(y\) and perhaps vice versa. In such contexts, these dependencies may jeopardize valid causal inference in undermining the validity of independent and identical distribution (IID), an assumption undergirding linear regression models (Rüttenauer, 2022, p. 729). The data at my disposal—country-year panel data—manifestly feature a geographical dynamic, with some observations being spatially nearer to or farther away from others. In addition, spatial dependencies, irrespective of the mechanism(s) producing them,18 have been observed in country-year data in sundry dimensions. These include poverty, inequality, corruption, pollution, levels of democracy and wealth, and policy emulation more generally (Ward & Gleditsch, 2008, pp. 2-11).19 The ubiquity of spatial dependencies in country-year data and the hazard they pose to valid inference jointly merit, in my view, the application of spatial models wherever such data is of interest—that is, throughout my dissertation—at least as a supplement to traditional, non-spatial methods.

  • 18 These may include spillovers or clustering (whether explained or unexplained) of outcomes, covariates, or unobservables (i.e., the error term). For more on these potential mechanisms, see Cook et al. (2023, p. 62).

  • 19 For additional examples of spatial dependencies in country-year data, see Cook et al. (2023, p. 62) and Wimpy et al. (2021, p. 723).

  • 20 For more on the popularity of SAR, see Rüttenauer (2024, p. 7).

  • Ultimately, I select the spatially-lagged X (SLX) model over other candidates. I do so motivated by recent literature on spatial econometrics, including Rüttenauer (2024; 2022) and Wimpy et al. (2021), which foregrounds the advantages SLX enjoys over its counterparts—particularly the spatial autoregressive (SAR) model, overwhelmingly the most common implementation of spatial lags.20

    The SAR model, which incorporates an endogenous, spatially-lagged form of the outcome (\(y\)) as a covariate, gives way by design to system-wide “global effects”: effects beyond first-order (i.e., nearest) neighbors, where an original unit’s \(y\) affects not only its first-order neighbor’s \(y\), but also the neighbor of the first-order neighbor’s \(y\) via the latter (and so on); and feedback effects, where changes in \(y\) in both high- and first-order neighbors return to affect the original unit’s \(y\). The upshot of these global effects is that they complicate inference. On the one hand, they preclude interpretation of coefficients as ordinary marginal/partial effects, or how a one-unit change in \(x\) impacts \(y\), ceteris paribus: both direct (\(x_{i} \rightarrow y_{i}\)) and indirect (\(x_{i} \rightarrow y_{j}\)) effects differ in value from the coefficients, a function of the feedback-loop dynamic,21 meaning said coefficients cannot be interpreted as simple \(x \rightarrow y\) relationships in either sense. On the other, the assumption of global dependence is not invariably appropriate: some spatial processes may not extend to the entire system (e.g., the international level, which regional clustering of democracy, wealth, and other variables generally suggests),22 or may take time to become fully realized, despite that SAR diffuses global effects exactly at time \(t\).23

  • 21 Namely, as a result of the spatial multiplier matrix.

  • 22 Indeed, Wimpy et al. find that global dependencies are “relatively rare in political science” (2021, p. 737).

  • 23 For more on the implications of SAR’s construction, including mathematical proofs thereof, see Rüttenauer (2024, pp. 11-14; 2022, pp. 731-736), and Wimpy et al. (2021, pp. 723-724).

  • Figure 1: Preprocessing Flowchart

    By contrast, the SLX model is highly flexible and easy to interpret. It begins with the more conservative assumption of local dependence (i.e., relationships between first-order neighbors), though researchers may still account for more global (i.e., higher-order) dependencies should they desire to do so. Also, by eschewing global dependence as a starting point, SLX is better suited for panel data, where spatial spillovers at time \(t\) may be confined to first- or low-order neighbors. Global effects being absent by default, SLX models need not entail feedback loops, enabling coefficients to be interpreted as ordinary, OLS-like marginal effects.24 Equally beneficial, SLX is fit for a wide range of inferential techniques, including machine-learning methods, since it merely involves introducing the spatial lags as an additional set of covariates. Finally, Rüttenauer (2022) and Wimpy et al. (2021) find that SLX is relatively robust to misspecifications of the spatially-dependent covariates.25 In view of these advantages, the authors posit that SLX is an apt starting point for model building, particularly absent strong theory implying that a different spatial model would be more appropriate (e.g., a valid belief in global contagion effects, in which case SAR would indeed be suitable).26

  • 24 This point is stressed by Rüttenauer (2022, p. 735).

  • 25 For more on their findings, see Rüttenauer (2022, p. 749) and Wimpy et al. (2021, p. 737).

  • 26 The advantages of SLX are elegantly summarized in Rüttenauer (2024, p. 23) and Wimpy et al. (2021, p. 725).

  • 27 Formula taken from Wimpy et al. (2021, p. 724).

  • SLX is of the basic form:27

    \[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{WZ}\boldsymbol{\theta} + \epsilon\]

    Above, \(\mathbf{Y}\) is the outcome, \(\epsilon\) the error term, \(\mathbf{X}\) the matrix of independent variables, \(\boldsymbol{\beta}\) the corresponding coefficients, \(\mathbf{W}\) the matrix of first-order neighbor weights, \(\mathbf{Z}\) the matrix of covariates (or confounders) expected to have a spatial influence, and \(\boldsymbol{\theta}\) the spatial coefficients,28 capturing spillover effects from \(\mathbf{Z}\) covariates on the outcome.29

  • 28 Wimpy et al. label \(\mathbf{X}\) and \(\mathbf{Z}\) differently to convey that their contents may vary (2021, p. 724). For example, \(\mathbf{X}\) may contain both the treatment(s) and covariates, whereas \(\mathbf{Z}\) may contain the latter, exclusively.

  • 29 More on interpreting the SLX formula may be found in Rüttenauer (2024, p. 8).

  • 30 That is, a two-dimensional matrix wherein each row sums to 1.

  • 31 This matrix depiction is taken from Rüttenauer (2024, p. 3).

  • \(\mathbf{W}\), an \(N \times N\) row-normalized matrix,30 is of especial import, containing the base information needed to compute the “spatial lags”—versions of each covariate from \(\mathbf{Z} \in \mathbb{R}^{N \times v}\) that take the mean of any given observation’s neighbors, and the inclusion of which ultimately makes for an SLX model. Together, \(\mathbf{WZ}\) denote the spatial lags. \(\mathbf{W}\) is of the form:31

    \[\mathbf{W} = \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & \ldots & w_{1, n} \\ w_{2,1} & w_{2,2} & w_{2,3} & \ldots & w_{2, n} \\ w_{3,1} & w_{3,2} & w_{3,3} & \ldots & w_{3, n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ w_{n,1} & w_{n,2} & w_{n,3} & \ldots & w_{n, n} \end{bmatrix}\]

    Here, each \(w_{i,j}\) element of \(\mathbf{W}\) gives the proportion of any given unit’s neighbors one must consider in computing the spatial lags. As a simple yet relevant example, consider a country-year dataset comprising three countries—\(i\), \(j\), and \(k\)—and a single covariate, such that \(\mathbf{Z} = z_{v}\). For a select year \(t\), let \(i\) and \(j\) be neighbors, and let \(j\) and \(k\) be neighbors. The row-normalized neighbor-weight matrix would thus be:

    \[\mathbf{W} = \left[ \begin{array}{c|ccccc} & i & j & k \\ \hline i & 0 & 1 & 0 \\ j & 0.5 & 0 & 0.5 \\ k & 0 & 1 & 0 \end{array} \right]\]

    Furthermore, let \(z_{v} = (i = 2, \; j = 4, \; k = 7.5)\). The spatial lags at \(t\) for countries \(i\), \(j\), and \(k\) on covariate \(z_{v}\) are thus found by:

    \[\mathbf{W}z_{v}^{\; T} = \begin{bmatrix} 0 & 1 & 0 \\ 0.5 & 0 & 0.5 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \\ 7.5 \end{bmatrix} = \begin{bmatrix} 4 \\ 4.75 \\ 4 \end{bmatrix}\]

    As aforementioned, researchers leveraging SLX may opt to model spillovers beyond first-order relationships. In this scenario, one need only derive higher-order forms of \(\mathbf{W}\) (e.g., \(\mathbf{W}^{2}\) for second-order neighbors) and compute the additional spatial lags, allowing for the expanded SLX model:

    \[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{WZ}\boldsymbol{\theta} + \mathbf{W}^{2}\mathbf{Z}\boldsymbol{\theta}_{2} + ... + \mathbf{W}^{n}\mathbf{Z}\boldsymbol{\theta}_{n} + \epsilon\]

    Though these higher-order lags are easy to generate, they may entail high computation and storage costs at the modeling stage. Specifically, introducing more spatial lags increases the number of dimensions for each dataset, hampering fit speed and enlarging file sizes, whilst suggesting the implementation of individual models for each maximum order, \(n\). Moreover, as regards my research, scant evidence or theory exists on which to justifiably set \(n\) beyond \(n = 1\). Altogether, these considerations motivate me to restrict my spatial lags to the first order, though doing so does not foreclose the possibility of implementing higher-order lags as robustness checks.

    To compute the spatial lags, \(\mathbf{WZ}\), I first obtain the spatial data needed to derive the weight matrix, \(\mathbf{W}\). This data is sourced from Natural Earth, a public domain map dataset containing polygons for all of the world’s countries and land objects, and accessed through rnaturalearth::ne_countries. From these polygons, and for every year \(t\), I use sf::st_distance to generate an \(N \times N\) matrix, wherein each element gives the minimum great-circle distance separating the borders of any two countries. The resultant matrix is structured such that—for countries \(i\) and \(j\) separated by a minimum distance \(d\):32

  • 32 By default, st_distance gives \(d\) in meters.

    1. \(d_{i,i} = 0\) and \(d_{j,j} = 0\),
    2. \(d_{i,j} = 0\) and \(d_{j,i} = 0\) when \(i\) and \(j\) share contiguous borders, and
    3. \(d_{i,j} > 0\) and \(d_{j,i} > 0\) when \(i\) and \(j\) do not share contiguous borders.

    Subsequently, I specify neighbors for the annual country-distance matrices under a binary neighborhood-membership criterion.33 Of course, one could define neighbors as a country-pair sharing contiguous borders under a “rook” rule; but doing so is to presume that many countries lacking shared borders (e.g., island states) do not belong to any geographical “neighborhood” and are thus immune to spatial processes. This assumption seems unrealistic, so I adopt the distance-threshold rule of 200 kilometers suggested by Ward & Gleditsch (2007, p. 12) to decide neighborhood membership. Under this criterion, with respect to country \(i\), country \(j\) qualifies for membership in \(i\)’s neighborhood if \(d_{i,j} \leq 200 \text{km}\). Every neighbor relationship being specified, I derive \(\mathbf{W}\) using spdep::nb2listw. Finally, with both \(\mathbf{W}\) and \(\mathbf{Z}\)—the common covariates enumerated in Section 2.3.1, as well as any chapter-specific covariates—I compute the spatial lags, relying on spdep::lag.listw to do so.

  • 33 Put differently, under the criterion, country \(i\) is classified with respect to country \(j\) as either a neighbor (1) or a non-neighbor (0). Ergo, no “partial” neighbor exists.

  • 34 For example, for country \(i\) with neighbors \(j\), \(k\), and \(l\), and with respect to variable \(z\), let \(z_{j}\) be missing but \(z_{k}\) and \(z_{l}\) be non-missing. Using exclusively known data, \(\bar{z} = 0.5(z_{k} + z_{l})\). Should \(z_{j}\) be imputed, however, this value may be introduced to the equation, in which case \(\bar{z} = 0.33(z_{j} + z_{k} + z_{l})\). The second \(\bar{z}\) is likely to resemble the first, provided that \(z_{j}\) is not an outlier, since the underlying equations differ only by the presence of \(z_{j}\) (as well as corresponding adjustments to the averaging weight).

  • In calculating \(\mathbf{WZ}\), lag.listw produces NA values for lags where at least one neighbor’s information is missing. These values can be handled in one of two ways: by imputing them, or by first imputing missingness in \(\mathbf{Z}\)—thus forming \((\mathbf{m}_{i})_{i=1}^5\)—and subsequently computing \(\mathbf{WZ}\) for each \(\mathbf{m}_{i}\). The first option, relying solely on the multiple-imputation algorithm, generates values without enforcing the spatial constraints imposed by \(\mathbf{WZ}\), where each element must be an average of the values of country \(i\)’s neighbors. Consequently, the imputed figures are unlikely to approximate the mean of country \(i\)’s non-missing neighbors—a reasonable starting point for estimating the true value of the lags—and they may vary significantly by \(\mathbf{m}_{i}\). The second option, by contrast, respects \(\mathbf{WZ}\)’s spatial constraints, computing the lags more-or-less normally—namely, by directly considering known neighbor values, supplementing these inputs with imputations where necessary. As a result, these estimates hew closer to the mean of country \(i\)’s non-missing neighbors and are therefore more realistic.34 To wit, the second method yields more consistent, accurate, and theoretically-valid estimates of the lags across each \(\mathbf{m}_{i}\). It is clearly preferable, so for each dataset \(\mathbf{S}\), I proceed by finding \(\mathbf{WZ}\) from imputed datasets \((\mathbf{m}_{i})_{i=1}^5\). These lags are then appended to each \(\mathbf{m}_{i}\), making five expanded imputed datasets, \((\mathbf{p}_{i})_{i=1}^5\), where \(\mathbf{p}_{i} = [\mathbf{Y \; D \; X \; WZ}] \in \mathbb{R}^{N \times (1 + q + k + v)}\)

    2.3.4 Start-Year Specification

    Upon completing the spatial lags, I specify the “start years”—the year at which at each \(\mathbf{p}_{i}\) may begin—where necessary. The exact justifications for these start years may be found in Section 3, Section 4, and Section 5; but at this juncture, a few clarifying points merit mentioning. First, they serve one of two purposes: to align with those adopted by the authors inspiring my work, or to test alternative theories in which another start year may plausibly yield different results. Second, the maximum number of start years set for any given chapter is two: although data availability generally allowed for a third start year beginning before those used by the authors, introducing it in preliminary model testing revealed prohibitive computational costs.35 Third, Chapter 3 specifies 1990—exclusively—as the start year, since setting start years more recent than this entails inferential risks.36 Ultimately, in specifying start years, there now exist iterations \(\mathbf{t}_{i,j}\) for each \(\mathbf{p}_{i}\), where \(j\) signifies the index number of the start years, ordered chronologically.37 In chapters 1 and 2, two start years are specified, such that \(j \in \{1, 2\}\). By contrast, Chapter 3 features a single start year, so \(j = 1\).

  • 35 To be precise, running a task with three start years usually led to memory limits being reached, after which the task would fail.

  • 36 Indeed, because the sample in this chapter is already limited to “Global South” countries, and in consideration of subsequent temporal lagging, a later start year would very likely give way to sample sizes so small that they violate Rubin’s pooling rules.

  • 37 For example, for Chapter 1, with start years 1977 and 1990, \(j \in \{1, 2\}\) corresponds to \(\{1977, 1990\}\).

  • 2.3.5 Temporal Lagging

    To evaluate whether the treatment effects of interest are delayed,38 I perform my final preprocessing step: temporally lagging all independent variables—the treatments (\(\mathbf{D}\)), covariates (\(\mathbf{X}\)), and spatial lags (\(\mathbf{WZ}\))—for each \(\mathbf{t}_{i,j}\) with respect to the outcome (\(\mathbf{Y}\)). Namely, I lag the independent variables by \(t - l\), where \(t\) is a given year for any observation of \(\mathbf{Y}\), \(l\) the number of years prior to \(t\) from which values of the independent variables are taken and aligned with \(\mathbf{Y}\) at \(t\), and \(l \in \{1, ..., 8\}\).39 Doing so transforms \(\mathbf{t}_{i,j}\) to iterations \(\mathbf{v}_{i,j, l}\), the ultimate datasets passed through my models.

  • 38 That is, whether the treatments require certain time spans to elapse before their effects can be realized or detected.

  • 39 The lower bound of the lags (\(l = 1\)) is the most basic and common temporal lag employed by social scientists. The upper bound (\(l = 8\)), on the other hand, seeks to test whether the intended outcome of a policy implemented at the onset of a hypothetical, consecutive two-term U.S. presidency may eventually be realized by its very end.

  • 40 Specifically, under such conditions, when independent variables are temporally lagged with respect to \(\mathbf{Y}\), early observations take on missing values and are effectively dropped as a result. For instance, if the independent variables are lagged by \(t - 1\) (i.e., one year), their values for start year \(t_{sy}\) appear when \(\mathbf{Y}\) is at \(t_{sy} + 1\). Because there naturally exist no data at \(t_{sy} - 1\) (i.e., the year prior to \(t_{sy}\)) from which to complete the lag, all independent variables corresponding to \(\mathbf{Y}\) at \(t_{sy}\) become missing. In the country-year case, this means that for every country, when \(\mathbf{Y}\) is at \(t_{sy}\), all independent variables are missing, and \(N\) is simply reduced by the total number of countries (since one observation per country now features missingness).

  • 41 For more on \(df_{com}\) and when \(df_{com} = \infty\) is appropriate, see ?mice::pool.table().

  • Temporal lagging is typical of inferential models, especially those concerned with international policy questions and country-year panel data. Nevertheless, the technique poses a few complications for my research owing to the idiosyncrasies of my overall methodology. First, absent extant data preceding the first chosen start year, \(N\)—the sample size of \(\mathbf{v}_{i,j, l}\)—inevitably shrinks as \(l \rightarrow \infty\).40 This is a structural feature of temporal lagging and is not problematic per se. However, pooling my results—the final step of the multiple imputation process and my models writ large (see Section 2.5)—requires that \(N > 1000\) be satisfied. Indeed, pooling demands the specification of the degrees of freedom of the complete datasets, \(df_{com}\); but as the DML method (see Section 2.4) prevents the consistent location of such a value—\(N\) varies over cross fits, and the number of model parameters may likewise vary over cross fits depending on the selected learner—it is safer to assume that \(df_{com} = \infty\), which is only acceptable when \(N\) is sufficiently large.41 Fortunately, virtually every \(\mathbf{v}_{i,j, l}\) satisfies the aforementioned condition, though there are a few instances—particularly as \(l \rightarrow 8\) and in Chapter 3, where the universe of cases is intentionally small (see Section 5) and the single start year (1990) is more recent—where it does not. I will identify such instances where appropriate, and the results thereof should be interpreted with caution.

    Second is the infeasibility of implementing finite distributed lag models (FDLs). An option available to political scientists interested in spatio-temporal processes,42 FDLs attempt to capture cumulative delayed effects ending at time \(t\) but beginning at \(t - l\) by including both contemporary independent variables (\(\mathbf{X}_{t}\)) and lagged versions thereof spanning the sequence of discrete time points (e.g., years) from \(t - 1\) through \(t - l\) (\(\mathbf{X}_{t-1}, ..., \mathbf{X}_{t-l}\)). For example, in a simple FDL, where \(l = 1\):43

  • 42 A discussion of FDLs and their utility for political-science research may be found in Cook et al. (2023).

  • 43 For equation, see Cook et al. (2023, p. 64).

  • \[ y_{t} = \beta x_{t} + \gamma x_{t-1} + \epsilon_{t} \]

    Here, both \(x_{t}\) and \(x_{t-1}\) feature in the model, and the cumulative effect of \(x\) on \(y\) beginning at \(t-1\) is found simply by summing the estimates of \(\beta\) and \(\gamma\).

    Typical problems attending FDLs arise from the selection of \(l\)—an arbitrary, user-defined limit—and multicollinearity (especially as \(l \rightarrow \infty\)), since each \(\mathbf{X}\) is usually highly correlated with its temporal neighbor(s) along the lag sequence.44 A more unique complication, however, appears in the context of DML. As we shall soon see (Section 2.4.2), the DML models I deploy partly rely on predicting the treatment itself from the remaining independent variables. FDLs in this paradigm, therefore, inexorably entail utilizing later treatment observations to predict earlier ones. From our previous example, estimating effect \(\gamma\) (i.e., the impact of \(x_{t-1}\) on \(y_{t}\)) necessitates computing:

  • 44 For a brief discussion of these issues, see Mignon (2024, p. 271).

  • \[ x_{t-1} = \beta x_{t} + \epsilon_{t} \]

    Nonetheless, modeling \(x_{t-1}\) as a function of \(x_{t}\) violates an obvious and elementary rule of causality: that cause \(x\) must temporally precede effect \(y\). Because syntheses of FDLs and DML suggest such impossible event sequences—namely, effects occurring before their causes—the two methods are fundamentally incompatible. I proceed, then, by forgoing the use of FDLs, finding treatment effects at each \(t-l\) independently (i.e., for each \(\mathbf{v}_{i,j, l}\)), though doing so admittedly precludes the potential for further investigation into cumulative effects.

    2.4 Double/Debiased Machine Learning (DML)

    2.4.1 Introduction

    With preprocessing complete, I turn to executing my models, which in all chapters operate under the double/debiased machine learning (DML) framework. First introduced by Chernozhukov et al. (2018), DML is an emerging class of inferential, econometric models. It involves solving at least two problems—a “main” equation predicting the outcome of interest (\(\mathbf{Y} \in \mathbb{R}^{N \times 1}\)), and an “auxiliary” one predicting the policy variable of interest itself (\(\mathbf{D} \in \mathbb{R}^{N \times 1}\))—in order to obtain an estimate of a true treatment effect (\(\hat{\theta}_{0}\) of \(\theta_{0}\)); hence, it entails “double prediction.”45 Furthermore, its chief purpose is to mitigate the regularization bias and overfitting that ordinarily occur when machine-learning (ML) methods (e.g., ridge, lasso, random forests, boosted trees, neural nets, SVMs, ensembles, stacks, etc.) are naively employed to estimate \(\theta_{0}\), making the approach “debiased.”46

  • 45 Note that now \(\mathbf{D} \in \mathbb{R}^{N \times 1}\) rather than \(\mathbf{D} \in \mathbb{R}^{N \times q}\), a notational adjustment I maintain for the remainder of the section. I do so not only align with the DML literature—wherein \(\mathbf{D}\) is almost always discussed as a single treatment—but also to emphasize that even in the multi-treatment case, \(\mathbf{D} \in \mathbb{R}^{N \times 1}\) for any computation of \(\hat{\theta}_{0}\), since the DML procedure must be implemented on each treatment to find each \(\hat{\theta}_{0}\). This point will become clearer towards the end of Section 2.4.2, wherein I elaborate on DML in the multi-treatment case.

  • 46 For more on the fundaments of DML, see Chernozhukov et al. (2018, pp. C1-C8).

  • 47 The nuisance parameter (\(\eta_{0}\)) is denominated as such because although it is necessary for obtaining the target parameter (\(\theta_{0}\)), it is not of direct interest.

  • 48 For more on this advantage of DML, see Ahrens et al. (2025, p. 4).

  • DML is motivated in the first instance by high-dimensional cases, where \(\eta_{0}\)—the “nuisance parameter” needed to find \(\theta_{0}\)—is highly complex, usually because the covariate coefficients comprising \(\eta_{0}\) are similar to or larger in number than the total observations (\(P \geq N\)).47 Nevertheless, the framework is a general means of achieving valid estimation of \(\theta_{0}\), even from datasets where \(P < N\). First, these cases may still exhibit high-dimensionality absent a “pre-specified, known parametric model for [\(\eta_{0}\)]” (Ahrens et al., 2025, p. 3). Indeed, the “true” model for \(\eta_{0}\) may be nonparametric or depend on covariates unknown ex ante—possibilities especially germane to observational studies. Second, in both high- and low-dimension settings, DML guarantees flexibility, freeing researchers to estimate \(\eta_{0}\) (and with it, \(\theta_{0}\)) using ML methods of all kinds, including nonparametric/nonlinear ones.48 Last, researchers working with \(P < N\) data benefit from knowing they may safely increase \(P\) if desired (e.g., for robustness checks) without needing to entertain major model alterations, since DML is constructed to accommodate \(P \geq N\). Therefore, even though \(P < N\) for every \(\mathbf{v}_{i,j, l}\) at present, DML remains a valid and useful framework for estimating my treatment effects of interest.

    2.4.2 Partially-Linear Regression (PLR)

    The specific DML model I elect to use is the partially-linear regression (PLR) model. I do so because it is the most generally appropriate—additional assumptions are needed to justify the use of other DML models (e.g., those containing instrumental variables)—and is the simplest, making it the easiest to interpret.49 The model is initially of the form:

  • 49 PLR is indeed the most basic of the DML models: it is built on two equations, clearly capturing DML’s double-prediction logic, while also being reducible to an easy-to-explain OLS regression (as we shall soon see). PLR thus serves as the paradigmatic model for all introductions to DML. A full enumeration of the DML models currently available in statistical software may be found in the DoubleML User Guide.

  • \[ \begin{align*} \mathbf{Y} = \mathbf{D}\theta_{0} + g_{0}(\mathbf{X}) + \boldsymbol{\zeta}, \qquad \mathbb{E}[\boldsymbol{\zeta}|\mathbf{D},\mathbf{X}] = 0 \\ \mathbf{D} = m_{0}(\mathbf{X}) + \mathbf{V}, \qquad \mathbb{E}[\mathbf{V}|\mathbf{X}] = 0 \end{align*} \]

    As before, \(\mathbf{Y}\) is the outcome, \(\mathbf{D}\) the treatment, \(\theta_{0}\) the effect of interest, and \(\mathbf{X}\) the covariates. Additionally, \(g_{0}\) and \(m_{0}\) are the nuisance functions, where \(\eta_{0} = (m_{0}, g_{0})\), and \(\boldsymbol{\zeta}\) and \(\mathbf{V}\) are the stochastic error terms. In the main equation, a user-specified machine learner predicts \(\mathbf{Y}\) by estimating \(g_{0}\), whereas in the auxiliary equation, another chosen learner predicts \(\mathbf{D}\) by estimating \(m_{0}\).

    As aforementioned, ML methods are vulnerable to overfitting and regularization bias when used naively. On the one hand, since they are often highly complex and flexible, ML models are prone to virtual memory of training data, compromising external validity.50 On the other, the regularization parameters employed in ML models to guard against overfitting (e.g., \(\alpha\) and \(\lambda\) in the elastic-net paradigm) introduce a new type of bias by forcing said models towards simplicity (e.g., shrinking coefficients towards \(0\) in lasso and ridge), perhaps artificially so.51 To overcome these forms of bias, DML relies on two techniques: (1.) orthogonalization and (2.) cross-fitting.

  • 50 Put differently, such models may do well at prediction using the original samples on which they are trained yet perform poorly when presented with new samples. Further heuristic definitions of overfitting are given in Chernozhukov et al. (2024, pp. 71 & 255).

  • 51 The example of “shrinkage bias,” a type of regularization bias, is given in Chernozhukov et al. (2024, p 74).

  • 52 PLR is “partially linear,” then: OLS is mandatory in the final step, but nonlinear (e.g., tree-based) models are permissible to estimate \(\eta_{0}\).

  • 53 The equation here assumes the selection of the standard score function, which does not estimate \(g_{0}(\mathbf{X})\) directly, but instead estimates \(\ell_{0}(\mathbf{X}) := \mathbb{E}[\mathbf{Y}|\mathbf{X}] = \mathbb{E}[\mathbf{D}|\mathbf{X}]\theta_{0} + g_{0}(\mathbf{X})\) (see the DoubleML User Guide). This explains why \(\ell_{0}\) now replaces \(g_{0}\). Below, I will elaborate on the role of score functions in DML.

  • The first—orthogonalization—serves to counter regularization bias. In the PLR case, this occurs through the “partialling-out” procedure, wherein the residuals of the main and auxiliary equations are extracted, and the former is regressed on the latter using ordinary least squares (OLS), yielding not only the regression coefficient \(\hat{\theta}_{0}\), but also the accompanying standard error and confidence interval.52 Specifically, consider the original PLR equation transmuted to residualized form:53

    \[ \begin{align*} \mathbf{Q} = \mathbf{V}\theta_{0} + \mathbf{\zeta}, \qquad \mathbb{E}[\mathbf{\zeta}|\mathbf{D},\mathbf{X}] = 0 \\ \mathbf{Q} = \mathbf{Y} - \ell_{0}(\mathbf{X}), \qquad \ell_{0}(X) = \mathbb{E}[\mathbf{Y}|\mathbf{X}] \\ \mathbf{V} = \mathbf{D} - m_{0}(\mathbf{X}), \qquad m_{0}(X) = \mathbb{E}[\mathbf{D}|\mathbf{X}] \end{align*} \]

    Here, the residuals \(\mathbf{Q}\) and \(\mathbf{V}\) correspond to \(\mathbf{Y}\) and \(\mathbf{D}\), respectively, after partialling out (i.e., subtracting) the effect of \(\mathbf{X}\) on each. In practice, DML estimates these residuals using the plug-in estimators of the nuisance functions, \(\hat{\ell}_{0}\) of \(\ell_{0}\) and \(\hat{m}_{0}\) of \(m_{0}\), initially found by the learners predicting \(\mathbf{Y}\) and \(\mathbf{D}\):

    \[ \begin{align*} \mathbf{\hat{Q}} = \mathbf{Y} - \hat{\ell}_{0}(\mathbf{X}) \\ \mathbf{\hat{V}} = \mathbf{D} - \hat{m}_{0}(\mathbf{X}) \end{align*} \]

    \(\hat{\theta}_{0}\) is then obtained simply by regressing \(\mathbf{\hat{Q}}\) on \(\mathbf{\hat{V}}\).54

  • 54 This explanation of PLR is adapted mostly from Bach et al. (2024, pp. 7 & 10-12). For an additional simple discussion of PLR, see Chernozhukov et al. (2024, pp. 249-251).

  • 55 The term “sample analogue” is also referred to in the DML literature as the “empirical analogue.” I use the former term, however, to preserve lexical consistency with earlier discussions involving samples \(\mathbf{v}_{i,j,l}\).

  • 56 See Chernozhukov et al. (2024, pp. 24-25) for more on the FWL Theorem.

  • 57 This condition may be found in Bach et al. (2024, p. 12).

  • The partialling-out operation results in a debiased \(\hat{\theta}_{0}\) for two reasons. First, the Frisch–Waugh–Lovell (FWL) Theorem guarantees that it actually gives \(\hat{\theta}_{0}\). Namely, since FWL demonstrate that the population regression coefficient, \(\theta_{0}\), may be recovered by regressing the true residualized \(\mathbf{Y}\) on the true residualized \(\mathbf{D}\)—that is, \(\mathbf{Q}\) on \(\mathbf{V}\)—it follows that in the sample analogue,55 regressing \(\mathbf{\hat{Q}}\) on \(\mathbf{\hat{V}}\) returns \(\hat{\theta}_{0}\).56 Second, in obeying Neyman orthogonality, partialling out alleviates regularization bias affecting \(\hat{\theta}_{0}\). All DML models estimate \(\theta_{0}\) using score functions \(\psi(\mathbf{S};\theta,\eta)\) that solve the sample analogue of the moment condition:57

    \[ \begin{align*} \mathbb{E}[\psi(\mathbf{S};\theta_{0},\eta_{0})] = 0 \end{align*} \]

    Specifically:58

  • 58 The generic DML estimator is expressed as such in Chernozhukov et al. (2018, p. C8).

  • \[ \begin{align*} \frac{1}{n}\sum_{i \in I}\psi(\mathbf{S};\hat{\theta}_{0},\hat{\eta}_{0}) = 0 \end{align*} \]

    Here, \(\mathbf{S} := (\mathbf{Y}, \mathbf{D}, \mathbf{X})\).59 The function \(\psi\) also obeys the Neyman-orthogonality condition:60

  • 59 The DML literature typically uses \(\mathbf{W}\) to denote the observations, but I use \(\mathbf{S}\) to maintain alignment with the notation seen in Section 2.3.1.

  • 60 The condition is expressed as such in Chernozhukov et al. (2018, p. C8). In Bach et al. (2024, p. 12), it is writtenn similarly as: \(\partial_{\eta}\mathbb{E}[\psi(\mathbf{S};\theta_{0},\eta_{0})]\bigg|^{}_{\eta = \eta_{0}} = 0\)

  • \[ \begin{align*} \partial_{\eta}\mathbb{E}[\psi(\mathbf{S};\theta_{0},\eta_{0})][\eta - \eta_{0}] = 0 \end{align*} \]

    Heuristically, this property guarantees that the moment-condition estimator is insensitive to minor deviations of the plug-in nuisance parameter \(\hat{\eta}_{0}\) about its true value \(\eta_{0}\).61 It follows that DML ensures a \(\hat{\theta}_{0}\) robust to regularization bias, a common source of such perturbations in ML. In the case of PLR, partialling out indeed implies a Neyman-orthogonal score function. Under the standard approach:62

  • 61 A more formal definition of the property is that \(\partial_{\eta}\), “the Gateaux derivative operator with respect to \(\eta\)[,] vanishes when evaluated at the true parameter values” (Chernozhukov et al., 2018, p. C8). Additional heuristic explanations of Neyman orthogonality may be found in Bach et al. (2024, p. 12), Chernozhukov et al. (2018, pp. C8-C9), and Chernozhukov et al. (2024, pp. 107-108), for example.

  • 62 The function presented here is “standard” insofar as it is the default for PLR computations in statistical software. For more on this score function’s alternative—one that is typically reserved for instrumental-variable models—see the DoubleML User Guide. For a proof of its Neyman orthogonality, see Chernozhukov et al. (2024, p. 283).

  • \[ \begin{align*} \psi(\mathbf{S};\theta,\eta) := [\mathbf{Y} - \ell(\mathbf{X}) - \theta(\mathbf{D} - m(\mathbf{X}))][\mathbf{D} - m(\mathbf{X})]\ \end{align*} \]

    Accordingly, PLR provides the same assurance as that on offer from DML models generally: debiased estimates of \(\theta_{0}\).63

  • 63 That the partialling-out operation (i.e., regressing residuals \(\mathbf{\hat{Q}}\) on \(\mathbf{\hat{V}}\) to obtain \(\hat{\theta}\)) implies the foregoing score function may not be immediately apparent to the reader. To obviate any misunderstandings, therefore, I offer a simple proof in Section 8.5 demonstrating how the OLS solution to \(\hat{\theta}_{0}\) may be recovered from it.

  • 64 For more on why cross-fitting helps address overfitting, see Chernozhukov et al. (2024, p. 273).

  • 65 If \(\mathbf{S}\) cannot be partitioned evenly into \(K\) splits, then \(n \approx N/K\).

  • The second bias-tackling technique—cross-fitting—handles the problem of overfitting by preventing data leakage, thereby ensuring that the observations used to obtain \(\hat{\eta}_{0}\) is independent of those used subsequently to find \(\hat{\theta}_{0}\).64 In generic terms, the procedure begins by randomly splitting sample \(\mathbf{S}\) into \(K\) folds of observation indices \([N] = \{1,...,N\}\) and of size \(n = N/K\), returning index sets \((I_{k})_{k = 1}^{K}\) as well as their complements, \(I_{k}^{c} = [N] \backslash I_{k}\).65 Next, the learners estimating \(\eta_{0}\) (e.g., those finding \(\hat{\ell}_{0}\) and \(\hat{m}_{0}\), respectively) are fit on training data \(I_{k}^{c}\) to return:

    \[ \begin{align*} \hat{\eta}_{0,k} = \hat{\eta}_{0,k}(\mathbf{S}_{I_{k}^{c}}) \end{align*} \]

    This is performed for each \(k \in [K] = \{1,...,K\}\). Finally, \(\hat{\theta}_{0}\) is obtained from a user-specified algorithm aggregating evaluations on test data \(I_{k}\) that themselves use \(\hat{\eta}_{0,k}\). Optionally, the entire operation may be repeated \(R\) times, where \(r \in [R] = \{1,...,R\}\), in which case the median of parameters \(\boldsymbol{\hat{\theta}_{0}}\) is selected as the final causal estimate.66

  • 66 This explanation is adapted from Ahrens et al. (2025, p. 19), Bach et al. (2024, pp. 16-17) and Chernozhukov et al. (2018, p. C23). A useful visualization of cross-fitting may also be seen in the DoubleML YouTube channel.

  • 67 This algorithm is labeled “DML2” in both the literature and software. The alternative, DML1, obtains \(\hat{\theta}_{0}\) by averaging \(\check{\theta}_{0,k}\), preliminary estimates of \(\hat{\theta}_{0}\) solved at each \(k\) fold. DML2 is generally preferred over its counterpart because it tends to behave more stably. For more on the two algorithms, including the recommendation of DML2, see Bach et al. (2024, pp. 16-17) and Chernozhukov et al. (2018, p. C24).

  • As aforementioned, the user sets the algorithm aggregating the test-data evaluations. The recommended and software-default algorithm, which I opt to follow, is of the form:67

    \[ \begin{align*} \frac{1}{N}\sum_{k = 1}^{K}\sum_{i \in I_{k}}\psi(\mathbf{S};\hat{\theta}_{0},\hat{\eta}_{0,k}) = 0 \end{align*} \]

    Here, each \(\hat{\eta}_{0,k}\) is plugged into the moment-condition estimator simultaneously, such that \(\hat{\theta}_{0}\) may be found from a single equation conducting all \(N\) evaluations.68 In the PLR case, this entails:

  • 68 Put differently, unlike DML1, preliminary estimates \(\check{\theta}_{0,k}\) of \(\hat{\theta}_{0}\) need not be found first.

    1. Extracting residuals \(\mathbf{\hat{Q}}_{k}\) and \(\mathbf{\hat{V}}_{k}\), themselves found by predicting on observations \(I_{k}\) with their corresponding \(\hat{\eta}_{0,k}\),
    2. Concatenating the residuals such that \(\mathbf{\hat{Q}} = (\mathbf{\hat{Q}}_{k = 1}, ..., \mathbf{\hat{Q}}_{K})\) and \(\mathbf{\hat{V}} = (\mathbf{\hat{V}}_{k = 1}, ..., \mathbf{\hat{V}}_{K})\),
    3. Regressing \(\mathbf{\hat{Q}}\) on \(\mathbf{\hat{V}}\) with OLS, thus yielding \(\hat{\theta}_{0}\) (as well as the standard error and confidence interval thereof).69
  • 69 A heuristic description of this procedure is found in Chernozhukov et al. (2024, p. 251).

  • 70 For more on the general appropriateness of \(K = 5\), see Chernozhukov et al. (2024, p. 276).

  • 71 To be precise, as \(R \rightarrow \infty\), one can be surer that the inference about the chosen \(\hat{\theta}_{0}\) is not an outlier, as per the Central Limit Theorem.

  • Further set by the user are the numbers of \(K\) sample partitions and \(R\) cross-fitting rounds. As \(K \approx 5\) is usually sufficient for medium-sized datasets, the software-default number of folds is \(K = 5\).70 This setting is indeed a suitable starting point for my work, since each \(\mathbf{v}_{i,j, l}\) is approximately medium-sized: \(N\) is always more than a few thousand but never on the order of tens or hundreds of thousands. More important to my selection of \(K\), however, are computational and time costs, which invariably increase as \(K \rightarrow \infty\). In fact, in the context of two-way cluster-robust fold splitting—a technique I subsequently implement (see Section 2.4.6)—these expenses balloon exponentially at a rate of \(K^{2}\). With \(K = 5\), virtually all of my models feature runtimes of several hours (see Section 8.6) and the high computational costs these normally imply. In the interest of easing model replication, I hew to this baseline throughout, refraining from any increases to \(K\). Similarly, I select \(R\) mainly in consideration of cost. Although the software default is \(R = 1\), I prefer to set \(R > 1\) so as to enhance confidence of the inference about \(\hat{\theta}_{0}\).71 However, increasing \(R\) also increases expenses by \(R(K)\) (for regular splitting) or \(R(K^{2})\) (for two-way cluster-robust splitting). The smallest \(R > 1\) number allowing for the ordinary median of \(\boldsymbol{\hat{\theta}_{0}}\) to be drawn is \(R = 3\); yet even at this setting (and with \(K = 5\)), runtimes are regularly several-hours long (see again Section 8.6). Therefore, to access the benefits of repeated cross-fitting whilst minimizing costs, I keep \(R = 3\) for every chapter.

    A final remark on PLR is that it is, like all other DML models, compatible with the multi-treatment case. Given a \(p1\) number of treatments \(\mathbf{D}_{1},...,\mathbf{D}_{p1}\) with estimated effects \(\hat{\theta}_{0,1},...,\hat{\theta}_{0,p1}\), DML finds each \(\hat{\theta}_{0}\) iteratively in typical fashion, the only departure being that for any \(\mathbf{D}\), all remaining treatments are added to covariates \(\mathbf{X}\). In the PLR case, this amounts to regressing \(\mathbf{\hat{Q}}\) on \(\mathbf{\hat{V}}\) for every \(\mathbf{D}\), where \(\mathbf{X}\) contains all treatments not \(\mathbf{D}\).72

  • 72 For more on DML and PLR with multiple treatments, see Bach et al. (2024, pp. 20-21) and the DoubleML User Guide. Support in DoubleML for simultaneous inference based on multiplier-bootstrapped test statistics is currently unavailable for clustered data (see the following DoubleML for R GitHub Discussion), hence why such inference does not appear in my robustness checks.

  • 73 Specifically, for \(\mathbf{v}_{i,j, l}\), \(p1 + P > 100\) or \(p1 + P > 200\) is typical. Of course, this explosion of costs is especially encumbering in the high-dimensional case (\(P > N\)), for which DML is originally designed.

  • Though simple, the multi-treatment procedure merits mentioning for a few reasons. First—and most obviously—it is implemented wherever I appraise multiple treatment effects simultaneously. Second, it underscores a key technical constraint DML imposes, one that limits me to obtaining and reporting effects on the treatments exclusively. Because every \(\hat{\theta}_{0}\) must be found individually in DML—indeed, the moment-condition estimator permits only one \(\hat{\theta}_{0}\) to be obtained at a time—costs in the multi-treatment case rise at a rate of \(R(K)p1\) or \(R(K^{2})p1\). It follows that if one were to estimate the effects of not only every \(\mathbf{D}\), but also all \(P\) covariates of \(\mathbf{X}\), costs would rise by \(R(K)(p1 + P)\) or \(R(K^{2})(p1 + P)\). So it is that expenses would explode estimating every effect of every dimension for datasets where \(p1 + P\) is large, such as \(\mathbf{v}_{i,j, l}\).73 The upshot of any such undertaking is that costs may prove not only hard to justify—runtimes may mushroom to the order of days—but also prohibitive, inasmuch as models may reach memory limits, and thus fail, on even the most high-performing personal computers. Thus, for my project, where the great size of \(P\) effectively guarantees such obstacles, finding \(\hat{\theta}_{0}\) for every dimension is not merely impractical, but virtually impossible. My choice of DML, then, admittedly implies a major tradeoff: quantity of \(\hat{\theta}_{0}\) for quality of \(\hat{\theta}_{0}\)—that is, effect estimates of all independent variables for debiased estimates of the treatment effects. This is a compromise with which I am comfortable—the treatment effects are of principal interest to my research agenda, after all—but it goes without saying that the inability to produce estimates of all covariates’ effects likely complicates future efforts to elucidate causal pathways.

    2.4.3 Learner Selection

    2.4.3.1 Considerations

    A cornerstone of DML, as suggested by its very name, is the use of machine learners estimating \(\eta_{0}\). These learners are always chosen by the practitioner, largely after contemplating three considerations: (1.) appropriateness, (2.) cost, and (3.) performance.

    Central to any appraisal of the first consideration—appropriateness—is an understanding of one’s data. First and foremost, users must know the variable types of the outcomes they specify for the main and auxiliary equations. If the outcome is continuous, they should select a regression learner; but if the outcome is binary, they should opt for a classification learner. Users also benefit from attention to multicollinearity among the predictors, a possibility especially likely in large-\(P\) settings. Indeed, some learners are better-equipped to handle highly-correlated predictors than others, such that the former are preferred wherever multicollinearity obtains.

    The consideration of cost, on the other hand, is evaluated chiefly with respect to two factors: hardware constraints and user tolerance. As aforementioned, some models may exhaust computer memory, prompting model failure. In many cases, this occurs due to the complexity of the learner itself—particularly at the tuning stage (see Section 2.4.5), where an appreciable number of complex fits being performed and stored simultaneously can quickly occupy total memory. Therefore, complex learners may prove strictly prohibitive, in which case simpler learners may be not just desirable, but necessary. Further affected by hardware are runtimes, which decrease on higher-performing computers (e.g., those with many cores), but which may still prove long for complex learners. Here, users must weigh the runtimes they are willing to tolerate, whether they be several-hours long or even days long, especially when accounting for any replications they or others may seek to execute.

    Performance—the final consideration—is perhaps the most amenable to rigorous assessment. Many of the most common machine learners (e.g., regularized regressions, tree-based methods, neural networks, etc.) plausibly satisfy the DML-optimal convergence requirement of \(\hat{\eta}_{0}\) to \(\eta_{0}\):74

  • 74 For more on the convergence requirement and the learners that satisfy it, see Ahrens et al. (2025, pp. 17-18), Bach et al. (2024, pp. 14-15), and Chernozhukov et al. (2018, pp. C25-C26; 2024, pp. 272-273), for example.

  • \[ \begin{align*} N^{1/4}\parallel \hat{\eta}_{0} - \eta_{0} \parallel_{L^{2}} \; \approx 0 \end{align*} \]

    Such learners immediately qualify as possible candidates for practitioners. Yet even among these, results may vary greatly for the same prediction task (e.g., in the PLR case, predictions of \(\mathbf{Y}\) with \(\hat{\ell}_{0}\) or \(\mathbf{D}\) with \(\hat{m}_{0}\)), in which case the reliability of not only \(\hat{\theta}_{0}\), but also the inference about it, may be jeopardized. Fortunately, users may buttress confidence around \(\hat{\theta}_{0}\) simply by locating the best-performing learner from among select candidates—that is, the learner most likely vis-à-vis its peers to minimize \(\parallel \hat{\eta}_{0} - \eta_{0} \parallel_{L^{2}}\) and thus the upper bound on its bias.75 Doing so is easily achieved by comparing the cross-fitted performance metrics for each prediction task, from which the best learner may be identified and causal claims made.76 Typical (and usually-sufficient) performance metrics for regression tasks are out-of-sample mean squared error (MSE) and root mean square error (RMSE).77 For classification tasks, on the other hand, users benefit from weighing a wider array of metrics minding not only their research goals (e.g., equally-accurate prediction of both positive and negative cases), but also the structure of their outcome (e.g., balance between positive and negative cases).

  • 75 In theory and in general, higher-performing learners are likelier to minimize \(\parallel \hat{\eta}_{0} - \eta_{0} \parallel_{L^{2}}\) because their plug-in parameter \(\hat{\eta}_{0}\) is likelier to approximate \(\eta_{0}\).

  • 76 Broadly speaking, cross-fitted performance metrics capture the difference between test-set predictions and their actual values. The processes and import of learner selection based on out-of-sample predictive performance is described in Ahrens et al. (2025, pp. 30-35), Chernozhukov et al. (2024, pp. 252-253), and the DoubleML Choice of Learners Guide.

  • 77 Needless to say, for the purpose of comparison (i.e., to determine whether a value is relatively large, small, or similar), the two metrics essentially convey the same information, insofar as the latter is simply the square root of the former. They are the default performance metrics for DoubleML and mlr3. For the former, see the DoubleML Choice of Learners Guide. For the latter, see ?mlr3::default_measures().

  • 2.4.3.2 Decisions

    Guided by considerations of appropriateness, cost, and performance, I select the ridge method as my baseline learner.78 First, since my treatments (i.e., auxiliary-equation outcomes) are either continuous or binary,79 ridge—as with most qualifying machine learners—is minimally appropriate, being compatible with both regression and classification tasks.80 Additionally—and more vitally—ridge is suitable for multicollinear environments, such as datasets \(\mathbf{v}_{i,j, l}\), which indeed feature highly-correlated predictors (e.g., the V-Dem indicators, covariates \(\mathbf{X}\) and spatial lags \(\mathbf{WZ}\), etc.). Between ridge and lasso, the cornerstone regularized-regression learners, it is well-known that the former outperforms the latter—oftentimes greatly so—when multicollinearity prevails.81 Accounting for this disparity are their distinct approaches to coefficient estimation: selection versus shrinkage. Empowered to minimize predictor coefficients all the way to \(0\), lasso effectively performs feature selection. Nonetheless, when presented with both highly-correlated predictors and changing model specifications (e.g., new training data during cross-fitting), lasso’s choices tend to vary, delivering erratic estimates. Ridge, meanwhile, may shrink the coefficients of correlated features towards but never at \(0\), such that all features contribute to their prediction task in some form. Also, ridge tends to shrink correlated predictors towards each other, meaning their joint influence on the outcome is usually similar fit-to-fit.82 In tandem, these qualities enable ridge to offer improved predictive stability over lasso, thereby promising smaller \(\parallel \hat{\eta}_{0} - \eta_{0} \parallel_{L^{2}}\) for my DML models.

  • 78 As they are well-established and exceedingly commonplace in econometric, machine-learning, and other research and applied settings, the candidate learners considered herein will not be delineated to any significant degree. That is, cursory research should normally suffice to locate the ad rem algorithms and procedures. As to ridge, lasso, and elastic net, however, the author submits Hastie et al. (2017, pp. 61-73) and the glmnet Vignette as useful theoretical and practical primers.

  • 79 Recall that HR Scores, a continuous variable, is the outcome of interest for each chapter. This means that every main equation features a continuous outcome.

  • 80 Ridge, specifically, may execute either task by setting the algorithm as linear for the former or logistic for the latter.

  • 81 Ridge’s superiority over lasso in the multicollinear setting is especially true when \(N > P\) (e.g., for datasets such as \(\mathbf{v}_{i,j, l}\)). For more on the limitations of lasso and advantages of ridge vis-à-vis multicollinearity, see Hastie et al. (2015, p. 56) and Zou & Hastie (2005, p. 302).

  • 82 For more on ridge’s tendencies given multicollinearity, see the glmnet Vignette.

  • 83 Specifically, elastic-net models usually tune for both \(\alpha\) and \(\lambda\), whereas ridge (\(\alpha = 0\)) and lasso (\(\alpha = 1\)) models may only tune \(\lambda\). Tuning grids are discussed in greater detail in Section 2.4.5.

  • 84 To be precise, low-resolution tuning grids entail a smaller number of tuning fits and ipso facto shorter runtimes.

  • 85 Tree-based learners, which randomly select features to inform their decision rules, virtually guarantee that only subsets of highly-correlated predictors contribute to any given tree, hence why their predictions are robust to multicollinearity. For more on this strength of tree-based learners, see Kuhn & Johnson (2019), for instance.

  • 86 A common solution to the problem of capturing possible combinations of several hyperaparameters is to randomly select such combinations a user-specified number of times (e.g., via mlr3tuning::tnr("random_search").) However, this approach is riskier

  • As to cost, regularized-regression learners such as ridge are relatively inexpensive. First, they enjoy fast runtimes and occupy little memory, since each estimate need only rely on a single solution to a least-squares optimization problem. Second, they are easy to tune. Indeed, the maximum number of hyperparameters that any of these learners may tune is \(2\)—though \(1\) is standard for ridge—such that one may reasonably capture the spectra of possible hyperparameter values (and their combinations, where applicable) with lower-resolution tuning grids.83 This, in turn, means regularized-regression learners promise quick runtimes and lower-memory usage still further.84 By contrast, tree-based learners (e.g., random forests and boosted trees), which handle multicollinearity at least as well as ridge,85 are far more costly. On the one hand, to produce a single estimate, tree-based learners must aggregate the results of hundreds or thousands of fitted trees, a procedure necessarily entailing larger time and memory expenses. On the other, a higher number of hyperparameters can and ought to be tuned to optimize performance; for example, following recommendations set forth in Bischl et al. (2023), mlr3 gives default tuning spaces of \(4\) and \(8\) hyperparameters, respectively, for random forests and boosted trees. Higher-resolution tuning grids are thus essential, compounding already-great time and memory requirements.86 Ultimately, these costs proved prohibitively expensive during my preliminary testing of tree-based methods—model failure due to memory limits being reached on my personal computer was unavoidable, even at the tuning stage—rendering a learner such as ridge easily preferable absent supercomputing access.

    Finally, ridge is likely high-performing in estimating \(\eta_{0}\). We have already seen how—in general—ridge is known to perform well amid multicollinearity, meaning we may justifiably expect it to prove high-performing given datasets \(\mathbf{v}_{i,j, l}\). And indeed, during tuning, the best ridge models were shown to perform approximately as well as the best tree-based models.87 However, as we have also seen, more complex learners demand currently-prohibitive costs, complicating efforts at comparison. A complete accounting of ridge’s performance thus awaits robustness checks leveraging supercomputing access, my plans for which I shall discuss in Section 6.

  • 87 I have yet to compile performance metrics from the tree-based learners due to aforementioned model failures, but I can do so if requested.

  • 2.4.4 Software & Hardware

    The DML method is readily executable with DoubleML, a package available in both Python and R via respective dependencies scikit-learn and mlr3. Though I experimented with each version, I settled on R’s by reason of user-friendliness, performance, and speed. First, offering the more streamlined experience was mlr3, whose dependencies automatically handle a variety of necessary and performance-enhancing preprocessing, pre-tuning, and post-fit tasks.88 Second—and likely relatedly—mlr3 performed far better than scikit-learn in preliminary tests, despite my efforts to manually replicate the former’s workflow using the latter.89 Finally, the runtimes in Python and R were effectively the same, meaning continued (and time-costly) efforts to improve the performance of the Python models stood to yield no patent benefits.90 DML in R, then, emerged as the more sensible choice for my work, guaranteeing ease of use and performance safety without sacrificing much—if any—speed.

  • 88 For example, the glmnet package, which mlr3 uses to execute regularized regressions, automatically performs covariate and dependent-variable standardization, \(\lambda\)-sequence generation, and coefficient reverse standardization. For more on these built-in capacities, see the Introduction to glmnet page.

  • 89 Specifically, during tuning for the main DML equation in Chapter 3, I observed that the best MSEs were closer to \(0.5\) for mlr3 but \(7\) for scikit-learn—a sizable difference considering the modest spread of HR Scores. I have not saved the latter results, but I can supply them if requested, since I did retain the scikit-learn scripts.

  • 90 Again, I did not save the former runtimes directly, but I can reproduce them if necessary due to my keeping the original scripts.

  • As we have seen, a major drawback of the DML technique DoubleML is also compatible with parallel processing, in R with future. software and hardware.

    2.4.5 Hyperparameter Tuning

    Figure 2: DML Tuning Flowchart

    2.4.6 Final Fits

    Figure 3: DML Flowchart

    2.5 Pooling

    Figure 4: Pooling Flowchart

    3 Chapter 1: PTAs

    3.1 Treatments

    3.2 Covariates

    3.3 Replication: Hafner-Burton (2005)

    4 Chapter 2: BITs

    4.1 Treatments

    4.2 Covariates

    4.3 Replication: Bodea & Ye (2020)

    5 Chapter 3: Economic Sanctions

    5.1 Treatments

    5.2 Covariates

    5.3 Replication: Peez (2025)

    6 Conclusion

    7 Bibliography

    Ahrens, A., Chernozhukov, V., Hansen, C., Kozbur, D., Schaffer, M., & Wiemann, T. (2025). An Introduction to Double/Debiased Machine Learning (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2504.08324

    Bach, P., Kurz, M. S., Chernozhukov, V., Spindler, M., & Klaassen, S. (2024). DoubleML: An Object-Oriented Implementation of Double Machine Learning in R. Journal of Statistical Software, 108(3). https://doi.org/10.18637/jss.v108.i03

    Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097

    Chernozhukov, V., Hansen, C., Kallus, N., Spindler, M., & Syrgkanis, V. (2024). Applied Causal Inference Powered by ML and AI (No. arXiv:2403.02467). arXiv. https://doi.org/10.48550/arXiv.2403.02467

    Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Taylor & Francis Group.

    Hastie, T., Friedman, J., & Tisbshirani, R. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (2nd ed.). Springer. https://doi.org/10.1007/978-0-387-84858-7

    Zou, H., & Hastie, T. (2005). Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

    8 Appendices

    8.1 Appendix A: Excluded Countries

    Figure 5: Table of Countries Excluded from Models

    8.2 Appendix B: Countries Coded Manually

    Figure 6: Table of Manually-Coded Country IDs

    8.3 Appendix C: Country Totals

    Figure 7: Table of Total Country Counts per Model

    8.4 Appendix D: Missingness Summaries

    Figure 8: Table Summarizing Missingness

    8.5 Appendix E: PLR Score Function to OLS

    Recall that the generic DML estimator solves the moment condition:

    \[ \begin{align*} \frac{1}{n}\sum_{i \in I}\psi(\mathbf{S};\hat{\theta}_{0},\hat{\eta}_{0}) = 0 \end{align*} \]

    Also recall the standard PLR score function, which may be further expanded:

    \[ \begin{align*} \psi(\mathbf{S};\theta,\eta) := [\mathbf{Y} - \ell(\mathbf{X}) - \theta(\mathbf{D} - m(\mathbf{X}))][\mathbf{D} - m(\mathbf{X})]\ \\ = -\theta(\mathbf{D} - m(\mathbf{X}))^{2} + (\mathbf{Y} - \ell(\mathbf{X}))(\mathbf{D} - m(\mathbf{X})) \end{align*} \]

    Solving the moment-condition estimator gives:91

  • 91 The distribution of the mean operator \(\left(\frac{1}{n}\sum_{i \in I}\right)\) and the constant (\(-\hat{\theta}\)) follows from the linearity of expectation.

  • \[ \begin{align*} 0 = \frac{1}{n}\sum_{i \in I}[-\hat{\theta}_{0}(\mathbf{D} - \hat{m}(\mathbf{X}))^{2} + (\mathbf{Y} - \hat{\ell}(\mathbf{X}))(\mathbf{D} - \hat{m}(\mathbf{X}))] \\ \Longrightarrow 0 = -\hat{\theta}_{0}\frac{1}{n}\sum_{i \in I}(\mathbf{D} - \hat{m}(\mathbf{X}))^{2} + \frac{1}{n}\sum_{i \in I}(\mathbf{Y} - \hat{\ell}(\mathbf{X}))(\mathbf{D} - \hat{m}(\mathbf{X})) \\ \Longrightarrow \hat{\theta}_{0}\frac{1}{n}\sum_{i \in I}(\mathbf{D} - \hat{m}(\mathbf{X}))^{2} = \frac{1}{n}\sum_{i \in I}(\mathbf{Y} - \hat{\ell}(\mathbf{X}))(\mathbf{D} - \hat{m}(\mathbf{X})) \\ \Longrightarrow \hat{\theta}_{0} = \frac{\sum_{i \in I}(\mathbf{Y} - \hat{\ell}(\mathbf{X}))(\mathbf{D} - \hat{m}(\mathbf{X}))}{\sum_{i \in I} (\mathbf{D} - \hat{m}(\mathbf{X}))^{2}} \end{align*} \]

    Where:

    \[ \begin{align*} \mathbf{\hat{Q}} = \mathbf{Y} - \hat{\ell}_{0}(\mathbf{X}) \\ \mathbf{\hat{V}} = \mathbf{D} - \hat{m}_{0}(\mathbf{X}) \end{align*} \]

    And where \(\hat{\ell}_{0}\) and \(\hat{m}_{0}\) are plug-in estimators of the conditional expectations:

    \[ \begin{align*} \ell_{0}(\mathbf{X}) = \mathbb{E}[\mathbf{Y}|\mathbf{X}] \\ m_{0}(\mathbf{X}) = \mathbb{E}[\mathbf{D}|\mathbf{X}] \end{align*} \]

    Now, consider a simple linear regression, wherein OLS predicts an outcome \(\mathbf{Y}\) from a single treatment \(\mathbf{D}\).92 The true coefficient on the explanator may be expressed as:93

  • 92 To wit: \(\mathbf{Y} = \beta_{0} + \beta_{1}\mathbf{D} + \epsilon_{i}\)

  • 93 That is, \(\beta_{1} = \frac{\text{Cov}(\mathbf{D}, \mathbf{Y})}{\text{Var}(\mathbf{D})}\). In the bivariate case, \(\mathbb{E}[\mathbf{D}] = \bar{\mathbf{D}}\) and \(\mathbb{E}[\mathbf{Y}] = \bar{\mathbf{Y}}\). The basic formula may be found in Rice (2007, p. 544), for example, albeit in scalar notation.

  • \[ \begin{align*} \beta_{1} = \frac{\sum_{i \in I}(\mathbf{Y} - \mathbb{E}[\mathbf{Y}])(\mathbf{D} - \mathbb{E}[\mathbf{D}])}{\sum_{i \in I} (\mathbf{D} - \mathbb{E}[\mathbf{D}])^{2}} \end{align*} \]

    Or, in the multivariate case, with covariates \(\mathbf{X}\), regressing residualized \(\mathbf{Y}\) on residualized \(\mathbf{D}\) implies (as per the FWL Theorem):

    \[ \begin{align*} \beta_{1} = \frac{\sum_{i \in I}(\mathbf{Y} - \mathbb{E}[\mathbf{Y|X}])(\mathbf{D} - \mathbb{E}[\mathbf{D|X}])}{\sum_{i \in I} (\mathbf{D} - \mathbb{E}[\mathbf{D|X}])^{2}} \\ = \frac{\sum_{i \in I}(\mathbf{Y} - \ell_{0}(\mathbf{X}))(\mathbf{D} - m_{0}(\mathbf{X}))}{\sum_{i \in I} (\mathbf{D} - m_{0}(\mathbf{X}))^{2}} \end{align*} \]

    The solutions to \(\hat{\theta}_{0}\) and the sample analogue \(\hat{\beta}_{1}\) of \(\beta_{1}\) are thus the selfsame, sufficing to illustrate how partialling out—that is, regressing \(\mathbf{\hat{Q}}\) on \(\mathbf{\hat{V}}\) to obtain \(\hat{\theta}_{0}\) using OLS—implies the standard PLR score function.

    8.6 Appendix F: Model Runtimes